So I watched a presentation by Thomas Wiecki and stole his example and code almost verbatim.

Talk: http://vimeo.com/79518830

Website: https://sites.google.com/a/brown.edu/lncc/home/members/thomas-wiecki

He gave the best intro to Bayesian stats with PyMC I've ever seen. In this post I'm going to pick apart his example and really work to explain some of the details. I'll also be playing with his example in a less advanced way than he did to show how flexible PyMC3 is.

As I walk through the experiments I'm trying to show the way I explored the library and fiddled my way to working code. So there will be some redundant code throughout this. It's meant for clarity.

Generating data

To start I need to generate some noisy data:


In [80]:
import scipy

x = numpy.array(range(0,50))
y = numpy.random.uniform(low=0.0, high=40.0, size=200)
y = map((lambda a: a[0] + a[1]), zip(x,y))

In [81]:
import matplotlib.pyplot as plt
plt.scatter(x,y)


Out[81]:
<matplotlib.collections.PathCollection at 0x116d10b90>

Great. We have data!

Linear regression

Now I copy the code from Thomas's talk:


In [82]:
import pymc as pm
import numpy as np

trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    y_est = alpha + beta * x
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)
    
    pm.traceplot(trace);


The distributions above show the likely values for alpha and beta. Just to clarify, there's one pair of charts for each of my random variables. On the left chart, the x-axis represents possible hypotheses for the value of the variable and the y-axis represents the likelihood of that hypothesis. Taller is more likely.

The chart on the right... I'm not sure. I've noticed the values tend to fall equally around my most likely hypotheses for each variable. I've also seen examples of these that look less random and more like straight lines when I'm doing something crazy. I assume if the charts on the right don't look like a fuzzy mess, then there's something wrong. That's all I know right now though.

Visually evaluating the graphs it seems alpha falls somewhere around 20 and beta has the most likely single value at .96 (though it's probably really just 1). Plotting the regression line using the above estimates...


In [88]:
import matplotlib.pyplot as plt
import numpy as np 
def graph(formula, x_range):  
    x = np.array(x_range)  
    y = eval(formula)
    plt.plot(x, y)  
    
plt.scatter(x,y)

graph('20 + .95*x', range(0,50))

plt.show()


This excites me to no end. We've gotten to what appears to be a pretty damn good fit for the regression line completely probabilistically.

But that's a fairly definitive answer that doesn't speak to our uncertainty. Let's check out the Bayesian view of the regression. In Bayesian statistics a linear regression isn't a single clear line. Instead, we plot many lines. We should find that most of the lines bunch up around what would be our true regression line in frequentist statistics.


In [89]:
import matplotlib.pyplot as plt
import numpy as np 
def graph(formula, x_range):  
    x = np.array(x_range)  
    y = eval(formula)
    plt.plot(x, y)
    
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*x'.format(point['alpha'], point['beta']), range(0,50))

plt.show()


So that's a mess. What we can see though is that the lines seem to be vertically constrained most in the middle of the chart. Can we change the line color to be transparent so we can see the answer compound and weight the chart so we can see a regression line given by likelihood?


In [123]:
import matplotlib.pyplot as plt
import numpy as np 
def graph(formula, x_range, color='black', alpha=1):  
    x = np.array(x_range)  
    y = eval(formula)
    plt.plot(x, y, color=color, alpha=alpha)
    
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*x'.format(point['alpha'], point['beta']), range(0,50), color='black', alpha=.0098035)

plt.show()


Excellent! Now we can visualize the certainty as the darker area in the chart. The darker the area in the chart, the more likely that the true regression line will pass through that area. Or. We can be extremely Bayesian about it and get an idea of the likely values for y given x.

Given this how can we interpret the chart?

The middle area has the most certainty about it. It's not completely certain of course but a quick visual guess shows that at the most certain point (roughly 25 on the x-axis) the value of our regression line will likely fall somewhere around 40 +/- 3 or so?

As we move to the outskirts of our data we also see that the line gets blurrier and blurrier which means we have more and more uncertainty about exactly which line fits. Thinking through this that seems rational to me. Afterall the data on the edges of our domain have very little information about what lies beyond.

Now let's plot our best guess of the true line through this and see how it matches up.


In [124]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*x'.format(point['alpha'], point['beta']), range(0,50), color='black', alpha=.0098035)

graph('20 + .95*x', range(0,50), color='red', alpha=1.0)
plt.show()


Not too shabby! Again it's not perfect. This whole thing is intensely fuzzy, but it definitely seems to fall in our most likely area at least.

Quadratic regression

Now. I'm really curious. Is it just as easy to fit a curve here instead of a line? One of the supposed benefits of Bayesian Statistics is flexibility of the methods. So let's rework the model to try that. Oh! And also we'll be fitting on the same data. Obviously a line is a best fit in this situation but I don't care. I just want to see what PyMC comes up with.

So here's what I'm thinking when I set this up. I don't really completely understand how this all is really working BUT I can see an obvious place where it seems we're specifying the general form of the equation to fit to the data. So I'm going to add an extra term to the example above with its own random variable 'mewmew'. Makes just as much sense as any greek letter I figure.


In [125]:
import pymc as pm
import numpy as np

trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    mewmew = pm.Normal('mewmew', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    y_est = alpha + beta * x + mewmew * x * x
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)
    
    pm.traceplot(trace);


OMG. Did it work? Let's run through the same graphing exercise we did above for the linear fit.


In [127]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*x + {2}*x*x'.format(point['alpha'], point['beta'], point['mewmew']), range(0,50), color='black', alpha=.0098035)

graph('14 + 1.75*x - .015*x*x', range(0,50), color='red', alpha=1.0)
plt.show()


BOOM!!!! We end up with the left/right side of the graph having a fuzzier look representing our increased uncertainty as we move from the middle of the graph. What's really interesting too is that mewmew has a low likelihood of being zero. It seems that little bit of curve fits this data very well.

Of course that curve is really an overfitting. SInce we know what the data was actually based on, we know a straight line is more predictive.

Sinusoidal regression (Shit hits the fan)

How about a Bayesian sinusoidal regression?? :D


In [135]:
import pymc as pm
import numpy as np

trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    y_est = alpha + beta * math.sin(x)
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)
    
    pm.traceplot(trace);


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-135-b25cbaf26a5c> in <module>()
      8     sigma = pm.Uniform('sigma', lower=0, upper=20)
      9 
---> 10     y_est = alpha + beta * math.sin(x)
     11 
     12     likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)

TypeError: only length-1 arrays can be converted to Python scalars

Nooo! Wait I know. Let's use the numpy sin function instead.


In [136]:
import pymc as pm
import numpy as np

trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    y_est = alpha + beta * numpy.sin(x)
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)
    
    pm.traceplot(trace);



In [138]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*sin(x)'.format(point['alpha'], point['beta']), range(0,50), color='black', alpha=.0098035)

graph('43 - 4*numpy.sin(x)', range(0,50), color='red', alpha=1.0)
plt.show()


BWAHAHA! But whoops I forgot to include the inner term to allow the algorithm to adjust the frequency (not just the amplitude. Let's bring back mewmew!


In [174]:
import pymc as pm
import numpy as np
import theano as th

trace = None
with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sd=20)
    beta = pm.Normal('beta', mu=0, sd=20)
    mewmew = pm.Normal('mewmew', mu=0, sd=20)
    sigma = pm.Uniform('sigma', lower=0, upper=20)
    
    y_est = alpha + beta * pm.sin(mewmew * x)
    
    likelihood = pm.Normal('y', mu=y_est, sd=sigma, observed=y)
    
    start = pm.find_MAP()
    step = pm.NUTS(state=start)
    trace = pm.sample(2000, step, start=start, progressbar=False)

In [175]:
pm.traceplot(trace);


Oh god that was painful to get to. You can see I used pm.sin above for my sine function and you might be like "Whoa! Why not just use numpy.sin?" I actually started out trying math.sin then that failed, so I realized DUH. I'm working with vectors and tried numpy.sin. But then I got some other crazy error and realized maybe PyMC has a version compatible with its random variables built in... and VOILA!


In [187]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*sin({2} * x)'.format(point['alpha'], point['beta'], point['mewmew']), range(0,50), color='black', alpha=.0098035)

graph('44 - 15*numpy.sin(-41*x)', range(0,50), color='red', alpha=1.0)
plt.show()


One thing I notice here is that my most likely line doesn't seem to match up with the shaded area at all practically... Also the ideal line doesn't look like a line that is possible from a straight sine function. The amplitude should be consistent.

The second part I know how to fix. It's due to the fact that I'm only plotting 50 points when I graph the ideal line. Here's a better version using 200 points:


In [199]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*sin({2} * x)'.format(point['alpha'], point['beta'], point['mewmew']), range(0,50), color='black', alpha=.0098035)

graph('44 - 15*numpy.sin(-41*x)', numpy.linspace(0,50, 200), color='red', alpha=1.0)
plt.show()


Hehehe... so that's interesting. A great way to make a sine function fit all the data is just to have it fill a certain area. The shaded area looks more like the line I was expecting though. Actually you can't easily see the shaded area... let me highlight that better:


In [198]:
plt.scatter(x,y)

for i in xrange(0,2000):
    point = trace.point(i)
    graph('{0} + {1}*sin({2} * x)'.format(point['alpha'], point['beta'], point['mewmew']), numpy.linspace(0,50,400), color='black', alpha=.0098035)


On top of that, that ghosting of the line I more expected is gone. Seems like it was more of an interference pattern perhaps. Either way, one thing that I notice is that if I multiplied the sine function by x, it could probably fit this perfectly (and terribly at the same time ;)

My brain hurts, let's put a bow on it

With that, we've finished exploring Bayesian regression with PyMC3. We've seen how flexible it is, where there are some difficulties, and while I still don't REALLY grok it, I have a bit more comfortability with the library. Hopefully you do too.

All of the code that's in this notebook should be all the code/data you need so feel free to knock yourself out!